-
-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternative PR to add duplicate marking to GroupReadsByUmi #940
Conversation
…etection strategies. Set duplicate flag as needed on all reads inside Template which are not the top scoring read set.
Tests could still use some reformatting to follow convention of other code. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #940 +/- ##
==========================================
+ Coverage 95.60% 95.65% +0.05%
==========================================
Files 126 126
Lines 7305 7460 +155
Branches 517 526 +9
==========================================
+ Hits 6984 7136 +152
- Misses 321 324 +3
Flags with carried forward coverage won't be shown. Click here to find out more.
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few things I'm confused about, but looking really good!
|chromosomes and `--allow-inter-contig` has been set to false. | ||
|read has mapping quality < `min-map-q` (default=1), c) the primary mappings for R1 and R2 are on different | ||
|chromosomes and `--allow-inter-contig` has been set to false., or d.) all non-primary reads are filtered, | ||
|if `--includeSecondary` and\or `--includeSupplementary` are set to false (default=false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|if `--includeSecondary` and\or `--includeSupplementary` are set to false (default=false) | |
|if `--include-secondary` and\or `--include-supplementary` are set to false (default=false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't grok 446-447 after multiple attempts at reading the usage (before looking at the code). Perhaps you don't mean "non-primary"? And if not, then I'm really confused!
|read has mapping quality < `min-map-q` (default=1), or c) the primary mappings for R1 and R2 are on different | ||
|chromosomes and `--allow-inter-contig` has been set to false. | ||
|read has mapping quality < `min-map-q` (default=1), c) the primary mappings for R1 and R2 are on different | ||
|chromosomes and `--allow-inter-contig` has been set to false., or d.) all non-primary reads are filtered, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|chromosomes and `--allow-inter-contig` has been set to false., or d.) all non-primary reads are filtered, | |
|chromosomes and `--allow-inter-contig` has been set to false, or d.) all non-primary reads are filtered, |
@arg(flag='t', doc="The tag containing the raw UMI.") val rawTag: String = "RX", | ||
@arg(flag='T', doc="The output tag for UMI grouping.") val assignTag: String = "MI", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we finally update these to use
val UmiBases = "RX" |
val MolecularId = "MI" |
|
||
/** Sets the duplicate flags on all reads within all templates. */ | ||
private def setDuplicateFlags(group: Seq[Template]): Unit = { | ||
val nonDuplicateTemplate = group.minBy { template => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want maxBy
here? To my eye, htsjdk's duplicate score returns higher values as better:
For example, if the duplicate score is the sum of base qualities, then don't we want the template with the highest sum of base qualities?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
} | ||
|
||
group.foreach { template => | ||
val flag = !(template eq nonDuplicateTemplate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
val flag = !(template eq nonDuplicateTemplate) | |
val flag = template ne nonDuplicateTemplate |
private def setDuplicateFlags(group: Seq[Template]): Unit = { | ||
val nonDuplicateTemplate = group.minBy { template => | ||
template.primaryReads.sumBy { r => | ||
DuplicateScoringStrategy.computeDuplicateScore(r.asSam, ScoringStrategy.SUM_OF_BASE_QUALITIES) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a compromise, consider setting the ScoringStrategy as a constructor parameter that's not exposed on the command line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't want to expose this anywhere - I want the freedom to change this if/when we decide using HTSJDK's duplicate scoring stuff was a bad idea in the future, without breaking any APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few things in here -
1.) hard-coding this seems like a really odd choice, we could wrap it if you're concerned about leaking API.
2.) There isn't a tie break any longer so, choosing amongst == baseQs will be random, yeah? which is why mapQ was included originally. I think the tests are passing now just because of the passed in sort order.
this is more tidy logic, nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as a point re duplicate scoring strategies - the core strategies for choosing the representative read, much like UMI clustering algorithms are fairly established in literature & while having others seems possible, not supporting established ones, seems like a limitation which would be self imposed, I'm happy to wrap these enums if that's a need from the projects perspective so that the HTSJDK api isn't leaking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lol, this did make me laugh a little. I would agree if deep thought had gone into the original duplicate scoring strategies ... but as the author of that original code (in Picard that moved into HTSJDK) I can tell you that it was me making it up on the fly and coming up with something vaguely sensible!
Over time I tend to think that the duplicate marking stuff in Picard has gotten far too complicated.
I'd also argue that there's little to no value tie-breaking on mapq. If the reads are considered duplicates of each other and have either a) the same sum of base qualities > Q15, or b) the same mapped reference length, there's a very high probability they also have the same sum of mapqs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One options seems fine. I think it's perfectly fine that that is sum of base q, nothing about picards implementation of these scoring strategies is complex, the usages deep in the code get to be rather complex, but this simple score calculator just seems to be good code reuse to me.
I'm happy to look at how map q changes of note my testing is with bowtie2 mostly - so that certainly is impacting my perspective here.
I'll stop pushing here & just test a few more things, give some perspective with data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @JoeVieira ... please let me know when you've had a chance to look at your data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tfenne it'll be a little bit before I'm able to dig back into that data. But will in the next week or so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be sensible to choose the read with highest mapq and breaking ties randomly.
@JoeVieira did you have any suggestions after reviewing some data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry all, I have no been able to review data at scale yet. It's on my list for next week i promise.
@@ -499,6 +519,10 @@ class GroupReadsByUmi | |||
|
|||
private val assigner = strategy.newStrategy(this.edits) | |||
|
|||
// Give values to unset parameters that are different in duplicate marking mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coupling these seems odd to me & deviates from the intended change set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide a bit more context of why it seems odd? My intuition is that if you're marking duplicates with this tool, you want to keep all reads (i.e. don't pre-filter based on mapq). Similarly, you want to set the duplicate flag on secondary and supplementary reads (and don't prefilter them) too. What am I missing?
Also, what's the "intended change set"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that the tool allows for multiple versions of filtering, allowing flexibility to choose what one is filtering & not seems like it's giving appropriate control over the functionality, rather than forcing a workflow.
It's true that, if one is keeping all reads, one must update all reads, but forcing all reads to stay, when marking doesn't seem like a requirement for that. The forcing to keep all is the odd part from my perspective.
Of course one could filter reads after doing this, if so inclined, but maintaining flexibility, rather than forcing workflows seems like a functional win for the toolset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think any option is "forced", it just has a default when the option is unset. You can always set the option and this will override the default. For example, you can explicitly set --include-supplementary=true
or --include-suppementary=false
. I think that's why the types on the command line parameters are Option[Boolean]
with default None
:
None
-> use a defaultSome(true)
-> include supplementarySome(false)
-> filter out supplementary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair. I'm being overly dramatic with "forced" & you're right, it can be overidden, you're the maintainers so the paradigm the tool uses is ultimately up to you.
What just strikes me as odd to change defaults when semi related parameters are passed - they aren't strictly coupled parameters, if they were strictly coupled i'd agree - i think this makes for a usage that implies a specific workflow & causes users to have to more closely read the docs if their workflow isn't that one.
As a user, when defaults change because of other parameters it can result in unexpected results - which are often hard to debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair, the alternative is to provide a second command line tool that performs the duplicate marking that is just a facade over group reads by umi, but that seems a bit heavy handed. I'll let @tfenne take this discussion over, as I don't have strong feelings either way, but would lean to the current implementation as it is well-documented in the usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this change because otherwise, if one wants to use GroupReadsByUmi
as an equivalent to MarkDuplicates
in picard, one would have had to add:
--mark-duplicates --min-mapq 0 --include-secondary --include-supplementary
which to me seems error prone. I don't see the harm in defaulting arguments one way when the intended usage is for consensus calling and another when the intended usage is for duplicate marking. The user is always free to set them explicitly.
@tfenne appreciate you tackling it. I certainly would have preferred feedback rather than this, but i'm happy it's getting attention - this introduces some oddities that were not present in my change set, while a couple of suggested edits on my PR would have gotten your point of using groupby instead of sorting across just fine. |
Co-authored-by: Nils Homer <[email protected]>
private def setDuplicateFlags(group: Seq[Template]): Unit = { | ||
val nonDuplicateTemplate = group.maxBy { template => | ||
template.primaryReads.sumBy { r => | ||
DuplicateScoringStrategy.computeDuplicateScore(r.asSam, ScoringStrategy.SUM_OF_BASE_QUALITIES) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DuplicateScoringStrategy.computeDuplicateScore(r.asSam, ScoringStrategy.SUM_OF_BASE_QUALITIES) | |
r.mapq |
Another thing to consider here - which I was planning on layering in after this PR is also allowing unmapped reads through this as well. ( which will necessitate an adjustment to CallConcensusReads also ) Removing the filtering & ensuring these reads get skipped for UMI grouping & just passed through would be how I was considering doing this. |
Added filtering of unmapped reads and reads with unmapped mates in the consensus calling iterator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Until we have a second duplicate scoring strategy that is not random, I agree that setting to sum of base qualities from htsjdk is fine. It is also a proxy for aligned length, as well as read length, which can vary even when the outer coordinates are the same.
Aside: I would agree that "by mapping quality" is effectively random, since read pairs that are duplicates of each other are almost surely aligned the same by virtue of being duplicates of each other and therefore having the same mapq by current aligners.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I did add a commit (I hope that's ok) to make it work with codecov (which now require s a token, even for public repos)
@JoeVieira and @nh13 this is my attempt to provide a cleaned up version of #934. I figured this was easier than trying to shepherd the original PR through with lots of comments - though I greatly appreciate you making that PR Joe as it showed me how small the changes actually are and motivated me to do this.
Some of the changes stylistic / scala idiom stuff. Functional changes include:
--mark-duplicates
is providedEdited to add: Oh, and also changing the code downstream to avoid the consensus callers seeing secondary/supplementary reads should a BAM generated with
--mark-duplicates
get sent downstream.